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B i Remarks : 

The applicants have corrected the claims 1, 11, and 21 as suggested by the 
Examiner, and also corrected claim 20. The undersigned review of the claims 
Showed that sosne original claims have been noted as "Previously Amended* in 
the last ajuendwent and the correct original notation has been used again for 
those claims not amended, 

Reconsideration after this amendment is requested, and the examiner is 
respeectfully traversed with respect to the rejections under 35 use 103. The 
rejection assumes a teaching to c-ne ot ordinary skill of the art based on 
assumptions which are not factual, and therefore the rejection is considered 
without merit. 

A declaration of an inventor, Brian Robert Prasky, together with the 
attachments to the declaration are appended setting forth reasons why the 
examiner's basic assumptions are wrong - 

Reconsideration and allowance is now requested. It is respectfully 
submitted that the application should be in final condition for 
allowance which is respectfully requested- 

RESPECTFCJLLY SUBMITTED 
(For the inventors) 

BY: Lynn L . Augspurger 
Registration no. 24 , 227 

Phone: 845-433-1174 
Fax: 845-432-9601 

\9S. Next after these remarks? follows the paper coorelating the Err AC and BTB 
nomenclature, and then follows the declaration of the inventor following 
webpages showing the references referenced in the declaration and the 
declarations 
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1 Introduction 



The task of hi^h performance branch handling consists of 
the following requirement: 



Excellent brajien handling techniques are essential for 
current und future microprocessors. Marty instructions^ I n 
different siage* in \h$ pipeline oF a wide-issue superscalar 
processor, Instruction issue also worlcs best with a la/je: 
iDsiruciioii window, leading lx> cvtcti more in5lruc4Lons that 
are 'irilSLgbt" in Che pipeline. However, approx-imaLely every 
sweoth instruction hi art iruuruciton 5<ream is a branch 
srisinjciion which potentinUy Eruecni|Ks <he trtstmcLion flow 
Chrongh the pipeline. 



2 tnj fTcring of ibc branch iwget address in 
a braiicli target address cache aflcr its first 
calwUHio™ ana an immediaie rekukl of d« 
program oounler aflei a branch large 1 fltfujess 
cadie rnatcb 



I an earty delermLnalion of the brnncb oatoome {the 
*0<alled br^ncK re$olptio«) 
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Dynamic towicrt prediction and control speculation 

3 • aa excctlcnt branch predictor (i.e. branch prediction 

technique) and speculative cxecmion mechanism 

4 often another branch is predicted while a previous 
branch is still unresolved, so the processor must be aWc 
to pursue ma or to speculation levels and 

5 an cfftcleiu r*rollin$ mechanism when a branch is 
rtii^pr^Elacd (mlrumising the branch misprediction 
penalty). 

An early branch resolUEion is supported by forwarding 
as &oot!i a$ possible to the branch instruction th? 
result* of compare insiwcibOJis ihat may be stored in a 
general-purpose Tegisier or in a special ctxidition-code 
regisicr. Branch testing coutd be moved forward In the 
pipeline as far as ci« Srt&rucnw decode stage. Previous 
calculations of branch targe* addresses arc cached i n a branch 
target address cache and access dudng ibe instruction fetch 
stage. 

The performance of branch prediction depends on the 
prediction accuracy and cxwt of misprcdic4kxn. Prediction 
accuracy cuji he improved by inventing beuei branch 
predictors, but the misprediction penalty depends on many 
organisational features: Hie pipeline length (favouring sooner 
over longer pipelines), the overall ©rganisaikrfi of the 
pipeline, ^hcihernns speculated bsm^lions cart be removed 
from JriKirrtat buffers or have to be executed and can only be 
removed In the retire stage. 

2 Static branch prediction 

Static branch prediction Is a simple prediction technique 
which cither always uses a fixed prediction direction or 
allows ibe-ocnripiter to-dder mine the prediction dh'eci Eoji.The 
prediction direction oFa branch insinjctiOfl is never chnnjjed. 
Simple hardware-fixed defection mechanisms can be. 

Pndki always not takem this is the simples* scheme 
because uhe Assumption is a straight tasmiction flow. 
Unfortunately, <Jue io frequent 3pop$ in on insdnjclion flow, 
elds tecfariqutf is not very etTectivc. This prediction technk|ue 
should not be confused with ite delayed branch lechniqiie 
(a popular lecMque in Ibe first generations of scalar RISC 
processors). The instruction m the delay slot is always 
escecutedi while chc p/edlci-noNa3t^4echmquecxecu<es the 
Instructions after ihc branch speculatively and squashes the 
instruction execution En i2iecase.<rf misprwlk;tion, 

Predict always tak&r. here brandies al the end of a loop 
iteration iac correctly predicted as long as the loop loops- The 
branch target address has to- be stored wiihin ihe instruction 
fetch unit co allow a zero delay, 

Backward brancfi predict taken, forward branch ptvdici 
not taken: here the iciea is lhat branches with branch target 
add/e&es pointing backwards Mem from loops and should be 
predicted taken, while other kind of branches are preferably 
not taken, 

Sornelirnes a bat in the branch, opcode allows ihc compter 
io decide trie predict direction either directly (bit set means 
"predict takcn\ bit not set means 'predict not taken* ) Or by 
reversing the Iwdware-deterrnined direction. 

The compiler may use several techniques for a good 
otnnpiler»boscd static prediction. It may ehhert 
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1 examine ihe program simourt for prediction (jbttrttbes 
at Ihe end loop iteration code should be predicted as 
taken, if-then branches predicted as not taken) 

2 relegate prediction to the prograrncr by compiler 
directive* or 

3- use a pfofite-b ased prediction by predicting the branch 
directions bosed on prior runs of the program with 
recording of ihc branch tehavioui, 

The profile-based prediction is nearly aJ ways better than the 
simpler direct ion- based predictions, 

3 Dynamic branch prediction 

In a dynamic: branch prediction scheme, prediction is 
derided on the computation history of the program. After 
a start-up phase of the program execution, where a static 
branch prediction might be effective, hJ$todcul bforraaliori 
is gathered and dynamic branch prediction becomes more 
effective^. In general, dynamic gives better result* ihau static 
branch prediction, but at the con of Increased hardware 
complexity. 

3. i Branch-target buffer 

The branch target a&Jrcss h needed at the same i3rne as 
ihe prediction. In particular, Ei shodd be known already 
in the. IF statue whether the as->^-undccodjed. bisiTvctbo 
i$ a (ccndin'ortaJ cc jncondExJonai) branch to allow an 
instruction fetch at the target address in the next cycle. The- 
Branch-Target Suffer CBTB) is a hraDcb-predlciion ewche 
(hat stores the predicted address for ihe next instniction 
after a branch (Lee and Smith, 1984). The BTB is 
accessed; rfuriag ttie IF stage. It consists of a table 
witH hrandi addresses, the corresponding target *Jdrcsses> 
yaid prediction information free: Fi%wte 1 for a sample 
BTB). The PC fb* the next inshpudiort to fetch is compared 
with ihe eriuic? in the BTB. If a matching entry i$ found in 
the BTB, fetching can start Immediately at the target address. 

Ffewrel Branch-tar$cit3atter 

Brafich address TdrgMaddrt« 



The BTB stores branch and jump lat^i addresses. Branch 
larger addi^e$s£s are predicted addresses, whdlc jump target 
addresses always transfer control, lumps (unconditional 
branches) arc usually much less frequent than conditional 
brandies. 

Fetching instructions from a new target address is Fast if the 
fetch address hits in the l-caehc, 1 Moreover, for procedure 
calls and return-s a small stack of return addresses Is often 
used in addition to and independent of, a BT&, $ucft a return 
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address stack appears, for example, in the AlpJia 21164 
0f$anL$6iJ as. a ll-crtfry ciicmlar buffer iliac makes [he last 
12 return addresses available* 

3.2 Oite-bir pmiicr&r 

The simplest dyi»m3c branch predktioa scheme is a simple 
Bfaoch History Table (BHT}< The BHT is a small buffer 
memory containing branch addresses Indexed by the lo*er 
bits of the address of a hfartch insruttion, entry of me 
BHT corttains One oil that indicates whether the branch was 
recently taken or no. If the bic is set, the branch ay presided 
taken. If ilte |»t is not set, the branch is predicted iua taken. 
!n ibe ease of a mtspredicrioa, the bat srae; Js reversed a^d so- 
ls the prediction direction. 

One-bit piedictors can also be implemented in the BTB by 
only storing Hie target addresses <rf flcedkied urtcn branches. 

The prediction state* of a I -bit predictor are shewn in 
Figure 1 (T starch for taken and NT stands for ma taken). 

F|gut*2 Ooe-bit predictqr4W*« 

Such a ! -bit predictor correctly predicts a branch aubeendof 
a Loop iteration, as long as the loop <K>e$ IKX exit. However, 
iji nested loops, a 1-bit prediction scheme wit I cause two 
^predictions for the inner Joop; ooe at the cad of the loop, 
when Ihe iteration exits the loop msiead of looping again and 
one when .executing the first loop iteration, when it predicts 
exiiiri$Le&d of looping, Such a double misprediction In nested 
Loops cs avoided by a 2-bk ptttdictor scheme, 

TWtf-Wf predictors 

In a 2-bit prediction scheme 2 bits Instead nf one are 
assigned to each entry in the BHT. The 2 bits stand for 
the prediction slates 'predtet strongly taken', and 'predict 
weakly taken', 'predict strongly not tafcen\ 'predict weakly 
hoc taken*. In the case of a mispredict ioa in the 4 stioagly' 
state cases, the prediction direction is not changed, rtlher 
tbe prediction goo into the respective 'weakly 7 state. 
A prediction must miss twice before !i is changed when a 
2-bit predLcdon sdtie^^e i$ applied, 

Two kinds of 2-bit prediction schemes arc used: the 
saturation up-down counter scheme demonstrated in Figure 3 
and ihe Sterne jjn'ven in Figure 

Figprt 3 Two- hat predictor sniuratKHi ccuncer strides 




lit ihe 2-bit saturation up-down counter scheme, the counter 
Es incremented! for each taken bmch occurrence and 
teaerncntedcoch tiroc the branch is not taken. Tbc counter 
is saturating, that is, it is not decremented past 0, nor is it 
hicrftirtecit6d pair X The most $ijrtifijcant bit determines the 
prediction. 



Figure 4 1wo-bh predtoor Euncs 
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The other he me given in Ingwe 4$ftm$ from the saturation 
up* down counter scheme by changing directly from, the 
'weakly 1 to the 'strongly* scales In ibe case of a second 
misprediction. This -scheme is applied in the UttraSPARC- 
I processor. Branches without prediction are inJdalised by 
the UUraSPARC-l processor io >e4itt safely r*» Wksen" 
n'rcmblayflrKl O'Connor, [996). 

Henncssy and Palicrsoo (1996) slievA«ed ibai iha 
rtii$pfaditi,ior^ of $?ECS9 pffogrwns vary from 1 % (tiatsa? , 
tomcat) to l5%<eontott) > with spice ql9% and gec 
at 12^, assrunilng a4096-emry BHT. 

The l-bti prediction scheme is extendable to a fl-bit 
scheme. However, studies- have shown thai a 2-bit prediction 
scheave-does almost an well ^$ a jr«bit scheme wiib n > 2. 

Tw#*bit predictors nan be impleraented in the BTB, 
assigning two stare bits co each ewry in die BTB. Another 
rotation, which is proposed for the PowerPC 604 and 62Q, is 
to use a BTB for target addresses and a BHT as a separate 
prediction buffer; Whik the BTB is accessed. in the IFg<aae . 
t he BHT prediction is perfonped in ibe PowerPC 604 and 650 
one eveJe later in iha ID *ta** ihicI imv (wmifc ^ previ« IK 
gTB predjctioA 

A mispredict in the BHT occurs for two reasons: earner a 
wrong jpjeas for that branch Or the branch history of a wrong 
branch is used been use of the way the table Ls indexed In an 
indexed table lookup, pan of the instruction address Ls used 
as an 'index* to identify a cahJa entry. Instruction addresses 
with ihe sflmc bit pyt tern use<3 as an index share the same tnble 
entry. Leading to frequent mispredicts if the cable is small. 

T\¥o-bli predictors work well for seiencinc Aoadng-poJni 
intensive pxogrames wbich contain many frequency 
catcctitcd loop-corurol branches. Shortcomings of the 
2*bit predktion schemes wise from dependent (correlated) 
branches, which are fiequerti in integer- dominated 
proa^ames. 

The followiag example of two branches, one dependent 
on the aber, <fcmonstraf.es chat L-bit and 2- pcodictars can 
poaentidlty mis predict every d cm. Let us hook.ni the following 
proa/am (Henncssy and Patterson, 1996); 

it (J=:=(>) ^brattchbl V 

if(<f=l} /^braochb2V 



In assembly language notation the program con be given as 
follows (variable d is nssigned K> legister At): 

i?n*z Rl , LI ; branch bl (tf ^ 0) 
addi JU r R0rn; rf-=«0,sorf = 1 
LI: &uM R3,Rl r *l 

bn«z , L2 \ branch b2 {d ^ 0) 

L2: 



Dynamic brunch prediction md cotttrol speznksi<N\ 

Consider a sequence when; d uJtcrnntcs between 0 und 2 
which generates asequence of Nr-T-NT-r-NT'Tfarbjfinchcs 
bl and b2. Thccxmjtion behaviour Is sjlvett In tl» following 
table 



Mfwrtrf rf« ? oO bl dbefowbl rf== 1 62 

0 y£$ F3T J J?cs Fnr 

2 no T 2 no T 
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Lower order 10 bits of the branch address. Depending an 
the i mptomcmailatn. the PHTs may alternatively be accessed 
usJtig 10 hhs qf ilia acfcfrctt of (he instracuon jnunedlately 
prior lo the branch under consideration (Pan *t a1 M 1992). 
The four I fr-cntry PHTs can also be viewed as a single 
4*-emry PHT. The.* U bit* arc required for cbc PHT toofcup. 
Therefore, 2 bits from Hie BHR arc concaienaud with 10 bibs 
from il>e branch address. 



If we apply a 1-bii predictor which •$ iraitiaHsed lo 'predict 
taken" for branches bl and b2> then every branch \s 
mispredicted. The same hehaviaur is shown P<jt Hie 2-bit 
predictor of figure 3 starling from the state 'predict weakly 
laken\ The 2-bit predictor of Figure 4 mispredicts every 
second branch execuiloit erf bl 4ir«J2 b2. A (U I ^correlating 
predictor (see bdow) can lake advantage of tbc correlation 
between the two benches: tt ndspaN&dici$ only In the llr& 
arerailon wiitn d-'L 

Correlating branch predictors usually reach, higher 
prediction xates for integer-intensive prO£rarn$ ihiin (he 
2-bit predktoi scheme and require only a small increase in 
Hardware cost 



Figure 5 IrcjilcrnKrtalion of* ?)-predicwr 
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J.J Correlation-based predictors 

The 2-bit predictor scheme only uses the recent behaviour 
of a singGe branch \q predict the future of Hint branch. 
Correlations bctsweeo differcru branch instructions are mi 
taken into account. Let us also look at tha tfeceiu behavSouj* 
of other branches rather than just ihc branch we nre trying to 
predcCL. 

The &><&\cd corwlai}ot\-bwed predktws developed by 
Pan et ah ( 1992} arc branch predictors that oddicionolly 
the behaviour erf ctbvr branches lomake a prediction, While 
2-bil predictors use self-history only, the correlati ngpredictrtr 
uses neighbour history a$ wtfi. Miuiy integer workloads 
featorc complex comrol-flows whereof the outcome of a 
branch is affected by the oucconies of recently executed 
branches. In other words* the branches are correlated 
(Fan eta]., 1992). 

A conrclaoion^based rtfedicto* <terKHed in short as an 
(nu ir) •pwdicSQr x uses the behaviour of the last m branches 
to choose from 2" branch predictors, each at which Is a 
i?-bit predictor for a single touch. The global history of 
the itx>$L recent m branches con be recorded in a w-bit shift 
register - called a Branch History Register (BHR) - where 
each bli records whether ihe branch was laken oc no! taken. 
Each time a branch in execution resolves, iis sign bit Ls 
shifted into the BHR. The: eonicni* of the BHR are used to 
address (Lnden}theenlries in a so-called Pattern History ToMe 
(PHT}. 2 Typacally 2-bit predi.OOrt *re uM In, PHTs, 

A CK 3)*prcdktoi use* 4hc bcnavlour of the last branch 
to choose bciweeji a pair of 1-bit prvdicKro 4ind u. 
eorrcMon-bwscd predictor denoied as a (2, 2) -predictor uses 
a BHR of 2 bits to choose among four 2-blc prfediei Ion tables. 
A 2-bit poxiictoj (wiinout £lol>al hifttccy) can simply be 
delved as a«X 2)-prtrdiclor, 

Figure 5 shows the impkrnentation of correlation-based 
predictor^ a type {1. 2>pr«lJctctf with four 1 Jt-enwy PHT$. 
The BHR btt paweYri &\tc\z ibe specific PHT. The entries 
of Ibe I t-cntr;.' PHTs are gencraLly accessed by itsing tiic 



3 A TUw-ievel adaptive predictors 

The two-level adaptive predictor was dave^oped hy Yeh 
and Pan (1992) «l the time as the closely related 
corrcla«Lon-bflscd prediction scheme. There are «vml 
variations of the two-level ad apt K* predktiori sctienie Yeh 
and Fa«(l?»93>. 

The task two-level predktoxases a single 'global* BHR 
of k bits lo index in a PHT of 2-blt coumers. Tha global 
BHR Is ypdabed with otiicomas from ail branches, Tbnts, not 
only ihe hi$ujry o* a branch, but also the history of other 
branches, Influence Ihc prediction of tike bjartch. All schemes 
that use a single global BHR ate called givbaHUtforfsctietites 
and correspond lo Pan et ai.'s correlation- based pradlcDQr 
schemes. 

Tti the ftrtpJesi case (here is a single global BHR 
(denoted G) and a single global PHT {denoted $), this 
simple predictor IscaUed GAg (A stands for 'fttaprfvO, All 
PUT rrtfiplcrnenlaticms of Yeh and Patt use 2 -bit predictors. 
An implementation of a GAg-nrcdkco/ with a 4-bit BHR 
lengjth therefore, also denoted as GAg{4j) is shown in 
Figure ^ J Tbc BHR is imptememed as a simple shift register 
shifting right to left with, the sign (1 for branch laten, 0 for 
branch not take*) of the last tt&lvc* branch at the rightmost 
bit position. 

Figure 6 impleniefliaifon of a GA£ (4)-prt dkioc 
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In the GAg predictos scheme the PHT Loofcup depends 
enLir^ty <3n the bit pjillem in the BHR and is completely 
independent of the branch address. The advantages of ihe 
^generare'GAg sclustneafe in simple Implementation and 
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the fact that Hie predicted ouioon*e of a branch can be known 
long bcfctfetheexeoilionof that branch (Pan et ai., 1^2). 

A simple GAgtfl-predktor often performs bctlcr on 
integer programs than a 2-bit- predict or (with a saturation 
up-down coorausr scheme), 

However, G Ag-predictors scHI sufife* fiom branch patterns 
lhat emerge scvearal times within a compulation. Two code 
seqaeaices my haw ihe same bit pailcm in *he BHR and 
<bus index the same panam tit the PHT. Since the branch, 
behavkmr of tlx* nvo code sequences may differ, the shared, 
pattern may lend to the wrort.fi. predict ions. 

Such wrong predictions can be restrained by additionally 

using: 

1 die (foil) branch address io distinguish multiple PHTs 
(caJfed pcar-addiess PHT.s) 

2 a subset of branches ikfined by port of toe 
branch address) io distinguish multiple PHTs 
<caitedpi!i>3e*PHTs> 

3 the {fuH) bf aacb address <o distinguish multiple 
BHRs (called per-addrcss BHRs) 

4 & subset of briirjcbes to distinguish multiple BHRs; 
(called per-scc. BHRs) or 

5 a combination ach&ntt. 

In the first two ca$e$,a singfe global BHR is combined wiih 
irouStipte pec-address sckctcd PHTs. denoted as C Ap- or with 
multiple, pcf-sei addnased FKTs, denoted as GAs. A GAp 
predictor with a. 4-bit BHR, denoted as GAp(4), is shown in 
Figure 7 and a GAs predictor vtkh a 4-bit BHR> denoted as 
GAs (4. 2"), Is down in Figure 8. In theGAs (4. 2*) predictor 
n bits of the branch address are used to deHae. 2 n different 
branch sets correspond^ to V PHTs with 2* entries each. 
Branches of tbe same branch set share the same PHT In a 
OAs predictor, 

Itgvmt 7 TipplcmCfitQUkm of a GAp<4) predictor 
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In fact, by rotating. R^ine S £0* to (he rigid and 
a&tiroirtg a 4»bil BHR, H ran be Been then a ttratau'ort-bmd 
(4, 2>prediaoris equivalent to aC A$(4) predictor* assuming 
n = 1 0 bits hi the breach address (compare with Figure 8). 

A second scheme class is deftnod as the fHtj^o&trvs* 
history tctemes where the fl/si-kvel branch history refers 
to the last A occiii*ffirtK^$ of the same branch instruction 
(using self-history only IX Therefore. a BHR is associated 
with each branch instruction io distinguish the branch history 
information of each brtneh. The BHRs record self-history 
in contrast to the neighbour-history recording BHR used in 
global history scheme*. Tlte per-addrcss BHRs are combined 
mutable which is called LhePer-addfcss Branch HisiotyTkbJa 
(PBHT) by Ychi and Pait. 

In [lie simptest per address history scheme, the BHRs 
index into a single global PHT. Such a two-level adajKfrft 
predictor is denoted as PAg (multiple psr-jitfdress indexed 
BHRs arid a single i^ubal PHT}. An implementation of a 
RAgW) predictor isshown in Figure 9. TWodiffcrcnt branches 
with the same BHT bh paoe*n seJect the ^ine PHT endry 
laadiEijj to unnecessary misprediction. 



figure $ hnpieuiMiaiiori of a PAg(4) pccdioJor 
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The eoidbictaiiort of rfig Hi pie pen-address BHRs with multiple 
pec-address PHTs f dcnotedaB a PApprcdictor and of multiple 
pcr-address BHRs with multi^k pet-*& PHTs, denoted as 
a PAs ptedtcux, i$ also possible. In the PAp scheme each 
branch has its own BHR and its own PHT. So d» number 
of BHRs; in the pef-addiess BHT and the number of PHTs 
is equal. However, the rrurnbere are not fixed. They depend 
on the number of branches in die prog/aine. 4 Coiioepwalfy, 
the BHR oonieM is used as un index to select m entry in its 
PHT The PHT is selected by the branch instruction address 
(PAp) ca by the bcandi set <PA$). An implemenimicm of a 
PAp(4) predictor ts shovm in Figure 10. The figure shows 
the case of two branches with the same BHT bit panem that 
hide*** ite saiti* line in me pc^aoMress PHTs. HowcvcTp 
itui tomch addresses select dirlercnt PHTs nrad thi^s dinTerent 
PHT entries. 

F^piTe to Impkancnurionof a PA p(4) predictor 
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The three two-Uiviel Adarnive predictors GAg, GAp and GAs 
itsa a single global BHR and together form ihe $li>bal hiviory 
jcfreme predDCtorsL'nicsept^ad.Sctorsare closely related to the 
correlation-based predictor, 



In the per-nddress history schemes only the execution history 
of d« branch itself has an effect on Us prediction. The branch 
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prediction h non<oTr«larjng - independent of the execu don 
history of other rjcanchcs. 

[rt cbey>pJ v J5^ to/dry .m^tow (be firshlevel bninth h&ory 
means (be lust k occurrences of (be branch infractions 
from the same subset. Each BHR la associated with a set 
of branches. The* sec attributes of a branch ca*i be determined 
by the branch opcode, the brunch class which is assigned 
by the compiler or by pan of the branch address. Since a 
r*r.$et adduced BHR is polcntuiUy updated wieh history 
from oil branches En the same set the prediction, of a branch 
is influenced by other branches in the same set (Yeh and 
Palt, 3 Again the three variations arc decerroined by The 
variation* hi die <tf^s#ujun of the second-level, namely 
SAg, SAs and SAp> Implementations of a SAg<4) and 
a 5As(4) predictor* are shown in. Figwcs 1 1 and ] 2. Figure I ] 
stio ws thai (heS Ag-r*ed5ctorrnay suffer Jrorn branch patterns 
that emerge ^sev^iime^wEihiaacortipiuaiion (th^^lrnv bi( 
pattern Jo the BHR $ select the same PHT entry in (beg*o5wJ 
PHTD. Moreover,, in nil pec-set history schemes* branches 
which fall into the same set [e.g. having the same n bits in 
the fcwajieii addr*s$) select the same entry in the BHT (and/or 
(be same PHT), This is demonstrated in Figure 12. 

Figure II ImpkmcnLitiDQ of a SAj>(4) pccifeor 
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FJfcure 12 ImpfemEntaiioo <jf d SM 4 ) predictor 
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The full list of Ych and Patfs two-lcvd adaptive branch 
predictors Is gjv^n as follows (Yen and Pan. 1 993): 

, GAg: globai BHR. ajohal PHT 

« GAa* global BHR, r^-sei PHTs 

• GAp: global BHR, p6T<i<jkJrcs$ JPHTs 

• PAg; pcr*uodrcss BHT, global PHT 
« PAs; per-address BHT. pcr-set PHTs 

• PAp: per-address BHT, per-address PHTs 

• SAg; pcp&ct BHT, global PHT 

• S As; pcr-set BHT, per- set PHTs and 
« SAp; per-sct BHT, pcr-addrcss PHTs. 

Tbc denotation of the iwo-lcvel adaptive branch predictors 
aretertVedftoiu the flowing table which $ive$ rtsirnphljed 
e-sltma<ton of the hard ware costs [Yeh arid Patt, 19W)c 



Scheme 


BHR 


No. of 








PHTs 


cost 


GAtfi) 


k 


1 


*+2* x 2 


GAstf.p) 


h 


P 


i + ^x 2 fr x2 


GAp(Jt) 


k 


b 


4 46 x 2^x2 


Mgfc) 


k 


1 


* x k + 2* x 2 




k 


P 


£x* + jjx2*x2 


PAp(*> 


k 


b 




SAsttt 


k 


1 


* xjt+2* ^ x2 




k 


P 


j x Jt + p x2*x2 


5Ap(ir> 


k 


b 





In (be table b is the nunihct o/PHTs or erwries in. the BHT for 
the pe^-addf e$s $checti^ p am! s denote the number of PHTs 
w entries in the BHT for the pcr-set schcoics. aasnrolngihai 
different pcr-set scheiuca are possible for BHR aeleaion and 
for PHTseLection. 

The sjnwtations of Ycb and Fait (1993) using the 
SPEC89 benchmarks show That che per formftnee of tbc 
global history schemes i$ s«isi(rve to the branch history 
length, Interference of different branches thai are 
mapped to (be aanke PHT -1$ decreased by lengthening 
(he globnl BHR Leading to belter prediction accuracy. 
Sirniiarly adding PHTs reduces ihe ^$$ibility of partem 
hi$(0ry interference by mapping inrterfering branches Lnxo 
different tables, 

In general, the global history schemes ore better than 
the per.adtfTe&s Schemes for the integer SPEC&9 prpgrarns, 
vrtulc the per-address schemes arebcirer for ti>e ftoadng- point: 
intenstve programs. The phemomenxm \% due to the ability 
of ihe ^6oSa] history ^chemc$ tio irtiUse branch correlaikm, 
which is often the case in the frequent if-thcn-clse starannemi$ 
in Jntiac.cr progranis, wblle che t^^daTcss themes are better 
in prcilictLng loop-con(rot branches which are frcquam in 
<be floating-point SPEC 89 benchmark program*. The ptr-&t 
hiMpry $clfceme$ are in between Other schemes, 

Comparing the co« cffcctivcmcss of the different schemes 
u$ing ihe foctfrtula* in the table grven above and a fixed 
hardware budget of S k bita, the most coU-cffectlve global 
hisiory scJicroe. is GAs{7, 32), the hear pef-addre$s scheme 
is PAs(6, 16) and for per-set schemes SAs(6, 4x16) scorca 
best, From these (hree configurations PAs(6, E6) achieves the 
htghest average predkeion accuracy. 

When gl^n & higher hardware buduct of 125 Jt bits, the 
most cost-erTectiw global history scheme is GAs( 13, 32>, 
the best pcr-addrcss sebcrne Is PAs{S, 2S6) and the b^t 
per^et scheme i$ &A$(9 f 4 x 31)l Of ih«e conliB^rarioaB 
GA5(13, 32) achieves (he highest measured prediction 
accuracy of 97.2 %* 

Ych and Patt oandixJc that global history schemes perform 
bertenban other schemes on intcgcjHdaminatcd pfofirams but 
rexpjire hi^er impkmenwtion co«$ to be effective ovwalL 
Ho^wcr, in the global history schemes, the pattern history 
o# different branches Lnterfere with each other if chey map 
<o (be swnc PHT, Therefore, long BHRs onoVor many PHTs 
should be used. 

Per-address history schemes perfortti beu« than other 
schwncs «n fl<>at]n|j*poini prograrocs, Pcr-set history 
schemes haw a rxjrfarrrfcanoe thai is similar to global 
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history scl^mes* aai integer programs and similar 10 
pcr-ttddcra schemes on floating-point intensive pro$r.*irt$. 

3.3 gsefect and gshare predictors 

McFarimg (i99l) analysed ihe 2-bit predictors and 
con^an^-bascd predictor schemes and introduced a 
number of new predictors. One set of tie 4; orrelatioci-hased 
predictors uses a hash tuition inu> tin? PHI instead of 
Jndexlfl§ tftt PHT to reduce conflicts 

Recall <har in Die cociclaiionr based predictor ycterne the 
(2. 2)-pred3ctor shown in Figure 5 requires 12 biLs for a 
PHT cubic lookup (assuming a single, unified PHT instead 
of tlx four PHT$); J bits from Hie BHR axe coraaicnaicd 
with 10 bi<s from the branch address. NfcFarling calls this 
bit concatenation in a C£*relalion*based or G As predictor the 
g&etea pcedicioc which concatenates some lower order bits 
of the branch, address ai*d of die bit pauem in (he 3RR, 

In contrast to simpJe indexing, McFarlings's #s/jord? 
predkiuruscs the bitwise exclusive OR of pure of the kcaitdt 
wWrcssand the BHR as a bash ftincdoii. To derrx.^yiraic the 
ability of IkkIi fM^flicUHTlypcs. Metaling uses the following 



Branch addrwss 


BUR 


gselect4/4 


gshareS/S 


00000000 


OOOOOOQl 


QQOG0001 


O0000GOI 


O00Q0000 


000&0001 


0OO0O0OO 


00000000 


11131111 


00000000 


11110000 


1 LI 111 11 


Ell 1 1 Ell 


1 0O0OOO0 


1 1 L 10000 


onimi 



Strategy gstkeJ 4/4 ccmcaremites uhe lower order 4 bits o# 
the branch address with ttic lower order 4 bits of trie Btffc. 
Strategy gshare S/S usos d* bltwiae XOR of all 8 bits of 
both the branch address m\ ihe BHR. Comparing gshare cVE 
and gsetect 4/4 shows that only gAfows Es able 10 s epaiaifc all 
four cases. The gtelett pnadunor cannot lake advantage of 
the di^irigijishing history in the upper 4 bits of the BHR. 

3.6 Hybrid pmlkions 

The second strategy proposed by Mcfttfllna; <1993) Is- to 
combine multiple separate branch predictors, each tuned lo 
a different; class of branches. Different branch predict loo 
schemes have different advantages* Hopefully, $wb a 
combining prrditfor thieves an even better prediction 
accuracy <han either of unc predictors used for coYttfoinarjo fl- 
ic predict a &iveji hranch, typically two or more predictors 
and a predictor selection mechanism are necessary in a 
combining predictor. In principle, all kinds of branch 
predictors ate eujidtdaries for combination of predktoiSw 

McFarling combined the 2-bit predlctac* widi cite 
gshare tm-lml adaptive predictor and concluded that, 
in (his combination, global bifttcmatinn can be used if it 
Is worthwhile;* otherwise, the usual branch direction as 
predicted by Ibc 2-bit predictor can be used. Another 
combination proposed by the same au<har is the cornbSnaLion 
of a PAp predictor 7 with the giitare scheme. Simulations 
with SPEC89 benchmarks sltowed chat bOlh hybrid predictors 
outperform the gthaw which itself is better than gselect and 
all ounce predictors for a given counter array size. 



Another kind ol hybrid pnettitfvr proposed by Young 
a«l $mj<h (IJW) combines a compiler-based static branch 
prediction with a dynamic predictor of metwc>-levela<!aptivc 
type. Profiling is used 10 collect the state prediction 
inforn.aiu>n {CJht « fll-, II^T). 

GnmwoJd el al, (1998) compared tlie SAg. gxitam 
and McFart ing's combining predictor (conibanirtg a two-bit 
predictor with tine ^.sWcr pctdfcu*) using <he SPECtmSl'5 
ben^m^rl^. The results are reported in the Table I. The 
<ablc shows that for SPECi«t95 bencrtraarfc programmes 
about efvecy sixth in$(ruetcan of ihe tiacc (the executed 
and committed instructions) ts a branch instruction and 
3n die n*eart inisj^edkciocl ra*a ihe «n0wriing predictor 
performs best with S,l% mispredictions. Further sirauladoo 
showed that the proee&sot typically i$$ued 10-100% more 
instructions than actually commit* due to speculative 
execution (Grunwald ot al. r I9°8i). 

Oilier simulau'oiL^ by K«on et «u (1998) using nn 
OLTP (online tronsactioo workload) on a Pentium Pro 
multiprocessor reported a misprediction ratio of U% wkh a 
branch Instruction frequency of a3?out 21^, The speculative 
execution factor, gi^en by the number of instructions decoded 
divided by the number of mstrucdons eontmlited, is \A for 
the database programs. 

Two different conclusions may be drawn from These 
«ltnalati^i n*suli$; branchi predictors SnxxiW be Improved 
further antVor branch prediction is only effective If the 
branch is predictable. If a bcaach ouccortie i$ dependent 
on irrcg^itar data Lnputs. as is ofien tbe case in OLTP 
applications or game-playing programs, the branch often 
sliaws- an Irrecpuiar 1 behaviour. Thh may be the reason* for 
the high misprediction rate of «be SPECin<95 benchmark 
program go. 

Numerous other selector and hybrid predictor types ate 
evaluated and reported in we rmwh literature. 

J. 7 Confidence estimation 

If a branch Is not or is not easily, predictable, its irregular 
behaviour will frequently yield costly misspcculaiicna. The 
prediaability of branches «m b* assessed by ttkJilJOrttlty 
measuring the confidence in the prediction. A low confidence 
Amirdt is a branch which fnaquently changes 115 branch 
olrcclion in an irregular way making its outcome hard to 
predict or even unpredictable. 

Confident* exsunaiiea i$ a technique for assessing tbe 
quality of a particular prediction. Bf applied to branch 
prediction,, a confidence estimator all cm pis to assess tbe 
prediction niade fry a branch predkuw. Because, each 
branch 5s eventually determined to have been prcdfctetl 
correctly or Incorrectly, the confidence estimator assigns a 
Twgh confidence" or a low confidence' to each pre^ttiprt. 
[ft combination with the two prediction outcomes 'correctly 
predicted" and 'incorrectly predicted'* four confidence 
classes can be measured: 

1 ccorectly predicted with high cojirktenc* 

2 correctly predicted wi«h low confidertce. 

3 incorrectEy predicted witti high confidence aad 

4 incorrectly predicted with low confidence. 
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Appiicadon 


Gmrmvrterf j^ufrai-n'flwar 
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Misspfdrfi&Jwi rate 
SAg £*JW<r C^WflWttf^ 
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54.6 
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Lai 


9.9 


gec 


25<X9 


5<X4 


49.0 


I2B 


23-9 


12.2 


perl 


228.2 


43.8 


52* 


9,2 




|].4 


go 


SiB.I 




34.5 


25.6 


MA 


24. L 




4145 


&.a 


71.7 


4.7 




4.7 


xfsp 


IS3.3 




39.5 


LOJ 


102 


6* 


vortex 


ISO,? 




5CU 


2J0 


B.3 


Li 




252.0 


20;0 


70,0 


10J 


12.S 


104 


mean 


267.6 


A6;2 


543 


8j5 


34.5 


8.1 



When a branch h actually resolved, the brunch son be 
Classified us belonging to one of these classes (Grunwald 
etaL,199». 

lb implErncnt & confidence estimator, jnforrnajiori from 
the bmntfi paction tables is us«L Already in early 19S0s. 
Smith ( I9S1 ) proposed to use saturation counter Lnfocniaijort 
to construct a eoofideocfc esiimaioj: The concept was to 
apeculaie more a$$re$sivdy when Hie confidence level is 
higher (Smith, 1998). Jacobsen et al. (19%) used a wjj 
dixmnci- c&untcr table in addition to the branch predictor. 
Each time a branch is predicted, the value In me table t$ 
compared to a threshold. If the value is above the threshold, 
then the branch i& considered to have high con 31 dene* and 
low confidence ttberwise, Tyson ct al. {1991) observed that 
a small number of branch history patterns typkaliy leads 
to correct predictions In a PA* ptedjeun scheme, Their 
CQtiftdeitt* estimator signed high com rtdence to a fixed set 
of patter ns and km confidence io all others (GruawaldetaL, 
L998). 

Cbnfldence estimation can be used Formulation control 
pwvkted that ways other than branch speculation can be 
used to utilise the processor resource*. Such alternative 
ways can be, too 1 exarnpic, thread switching in multithreaded 
processors (lingerer ct al., 2002, 2003) of muLiipath 
execution where Itkstructioris from both branch directions 
nre fetched and executed,, and the wrong path instructions 
arc afterwards discarded. In a simultaneous multithreaded 
proce&siv (Un^crer <u «L 3002, 20D3), it may be more 
cog effective to switch threads than speculatively evaluate 
a branch of low confidence. In a muliipaih execution mode] 
both branch path* af a low confidence branch may be 
evaluated, whereas a conventional branch speculation may be 
employed to high confidence branches. Both lechmqujes need 
the ability of a rtroci^sor Co pursue two different instntaion 
strcmis sirnultaiteinjsly. Because of mc limitation of a single 
insurocikm pointer In today's superscalar processors, such 
icchnkrues axe conjured to multithreaded p/ocessors and 
relaDed processor tecfiniques such as multisealar and trace 
processors (Si Jc ct aU 2000). 



3.8 Predicated hutrucsiom 

One technique thai alios** u$ to 'tvajgauf tw branch paths in 
aiTrultir^-issucpre^cssoris^ 1997, 
1 993; Hwu, IWS; Mahlke ct al., 1995). Using this cecltni^e; 
thc insiructioin setwclkiteeLUre of a prooe«orts enhanced by 



so-called predicated or conditional imimcfior^ and one or 
more predicate sxgizttrt. The Boolean result of a condition 
test is reoooded in a (I - bit) predicate register. lYetfitaied 
instructions use a predicate regpst&r as an additional input 
operand. 

Prediction is o^mocislrateciby the following sew roe code 
sequence; 

jfr> ==0> [ /*bnn*W*/ 

g = f\t. U f* instructioo independent of brOTch bl V 

Translation of the exjsmple source code sequence, using 
a branch instruction for the alternative^ would lead to a 
speciiiarivo cxtcutiorj with instruction g = h *i and all later 1 
Lnstmctions on the speculative path of branch bl . Tn tfie case 
of a misspcculation terapewary resale of thU and all later 
lnstfuaEcni$ wovl4 be linncwwrjrily discarded. However, ibe 
5wrcecode Is translated in the following code sequence using 
predicated inslructions {cacti line represent^ a single raachjoe 
operation): 

(Fred = (x == 0)) 
If Fred liken a « fc + c; 
if Fred then rf = c- /; 

As can be seen From the warnpl* predication is abSe in 
eliminate a branch and therefore, rJie associated branch 
prediction ineceasartjg; the distance betw-ccn mispredictions. 
Abo the run length erf » code Woclc is increased which allows 
belter instruction scheduling by an optimising compiler. 
However, the convpile/ must smure that the exception 
^eharviour is rKX dian jed by moving the inslruclion across 
a set-predicate instruction- 
Predication affects the Instruction set, adds a port Id 
the register file and complicates instruction execution, 
Predication Is ii lost effective when control dependencies can 
be completely eliminated, such as in an If-then whh a small 
*hen body and when the condition cun be evaluated early. The 
useof predicated instructions is Limited when .the corttfol flow 
involves more than a simple alterative seance, Moreover 
pcodituxud ins(ructions that ore discarded still consume 
processor resources; the fetch bandwidth (s esneriaJJy 
affected. 

If me full iiiSLmaiun set 5s predicated (a so-called full 
predication model), predication bats in the oncode are 



addkLctfallyneedjed for each instruction io denote a predicate 
register. Thus, often; only a few Instructions of the Instruction 
set architecture* in moat cases the Lead instructions,, arc 
predicated Lnscrociions, 

Most signal processors, niglt^foTmar^e 
inicroconLrolters and stxme cortiempuTary superscalar 
processors employ predication. As eaarnpks, the ARM 
pcorosor otnd the Intel Merced a/e fully pleated, 
while Alpha, MIPS, PowerPC and SPARC processors use 
conditional move irtslrectiorus. 

Ptedacated instructions are fete lied, decoded 4irtd p3ncoJ 
in trie instruction window like nort-predicalcd insuxjeuons. 
li depends on tbe processor architecture bow far a pedic&led 
instruction proceeds speculatively in trie pipeline before its 
predication is resolved; 

• A predicated instruction executes only if ils predicate 
is true, otherwise thfc in struclaon is discarded. Id this 
caseprtsdiLTsiwl instmctioftsarenot executed be/ore 
the predicate is resolved. 

• Alternatively, as reported fot'InwrftlA^jnslnjglioii 
sctarchUK-ru^, the predicated instruc4ion may be 
executed* but corumitsonly If the predicate is true, 
otherwise the te$yft j$ discarded (Dulon^. 1998). 

The latter case is sirollaf to die eajier ormultLpailh execution 
model described below, 

3.0 Efiger execution 

With the eager or muttipaih execution modal, execution 
proceeds down both paths of * branch imd no prediction 
is made. When, a ixsnch resolves* alt operations on the 
non-taken path are discarded. Consr^ucrady.ea$er execution 
with unlimited resources, which can be characterised as 
*orac3eexecudort', would givclhesamelhectfetical maximum 
perrormafKe as a perfect branch prediction. With limited 
resources, the eager execution strategy must be employed 
careftilty, Resource consumption rises exponentially with 
each level of branches that are executed eagerly, TTiercforc* 
instead of employing full eager execunon,. a mechanism is 
n^uirtd that decides when to employ prediction and when 
eager execution. 

One derision mechanism is the use of a confidence 
estimator, ilf a branch prediction can be made with hi#h 
confidence, brancti prediction und single path speculative, 
execution is employed; when low confidence Is the case, 
eager execution spares the rtvisprediaion penalty. 

Until now* the eager execution strategy has rarely 
been implemented, -except for limited applications, such 
a$ instruction fetch an the SupcrSPARC processor and in 
the IBM (Uht ei aL, 1997) and $tibt5«|ucnl IBM 

mainframes* for example, the IBM 5090 processor. 

Tbe 'rtanoihreadeif PanSoft processor implements a 
multipaih-exocution model using corifldenee Infoomadon 
framasiaifc frarKh paction mechanism {Gwcruiap, 1997), 

A rturaber of research projects have; surveyed 
eager execution. The PoiyPath arctulccturc (Klauscr et at,, 
1998) enhances a superscalar processor by a limited 
TnuUipoth. execution feature to employ eager exec uti on. Hcil 
srtd Smith (E996) propose selective dual path exec ui ion ami 



Tyson el al. 09/}) propose a Limited dual path exccudon. 
Wallace et al. survey threaded mulclpath execution, 
employing e*£jw execution in a $n*o(iaiie0us multithreaded 
processor model. 

Unger et al. (I998> propose a compile* technique 
calkd simiduMeous s^adoslost scheduling h cdnbinalion 
with a 'minimar multithreaded execution model Co enable 
spcctfMve execution of altenwlivie program pailis. Thei 
technique is only applicable for aceltiLA£iure$ th^t fulfil 
cenaJn feq^irenu»rti$ of a bm muhi threaded processor 
model; 

• first, the processor mua be ahle to pur^cie two or more 
threads of control concurrently, ibat [s, [t must provide 
iwo or more iDtfcpcndcaii program coumersv 

» All corcuircmly executed threads of control share the 
aarrm a^Uiess spnee, preferably the same register set 

• Tlie in$Lnjciicin sc4 mu« provide a number of 
thread-handling instructions: here cf»c mint mil 
requirements for mutiithreadlftsj are an insuuetjarfe foe 
creating a oew threiid (fork) and an tiKtrvction that 
conditionally stops iis qwr execuuan or die execution 
Of some other thrends tsyoc). 

• Creating a new thread by the £ oyK instrucdori and 
joining thrawfe by Ihc &ync ir*s<ruction rcnis< be 
cMrcrnely fast preferably single-cycle operations. 

Uht and Strtdagt ( 1995} r^oposc ihc dlsjcbti eagtr execution 
technique- Tbe idea is in resources to branch paths 
whose results are mo3 likely io be used, that \&* brancbes 
with Lhe highest cumulative execuitan probahilliy. Uht 
and Sjoda&Ts nodon of hraocb execution pro^ibiEiiy i$ 
do^seiy related io ihv confidence m a brajich prediction^ for 
which they use the branch prediction accuracy, thai Is, the 
percentage of taken orwi taken executions oPa, brunch, 

While a branch palb is spccularj>fely executed, further 
branches may be- encountered before tbe first branch resolves, 
pfbstt iei?gliin^ in a branch speculation kvicl of tour or 
more- The cumulalive eajecuLion probability acouraulatcs 
the prediction accuracies of a branch atfci of ihe peandtng 
(prcdicicd bul >^et lo be resolved) branches of previous 
speculation levels. If all branches in such a sequence of 
pending brancbes are simply assumed to be independent of 
each other, the single predict loo accuracies can be multiplied 
to determine the eumulati w execution probability of the last 
branch in ibe sequence. 

Thus in the disjoint eager execution model, all branches 
arc predicted, the cumulative prediction accuracy is 
computed and compared lo the ^curacies of all branch paths 
that were yet io be chosen for speculative execution. The- 
branch path wir,h the NigbesiCtfrnulative prediction ^ceuraey 
is executed, leading to rithei another single path speculative 
execution or an eager cttcculforir 

Tlie ihree different possibifittesof single path speculative 
execution & ptoduced by tbe usual speculation methods 
described above, full eager execution and disjoin! eagea 1 
execution are demonstrated hi Figure 13 (Uln et rjj M 1997), 
Each line wilh 5m amriw represents a branch path marked by 
its cumulative probability. fV illustration, branch prcdiciaon 
^ocuracy Is 70% for each individual branch. All branches are 
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pending. Branch paths with circled numbers are i nexecution, 
branch paths char are iwa el*>$en try ihe paidieiion arc the 
paths v^thoo t circkd nurobcrs. Circled numbers indicate the 
orto of the resource assi^racnt, thai is. the order in whEch 
tlte paths are speculatively assigned Figure 13(c) shows <nat 
the disjoint eager e* coition strategy iiUocnto resources to 
more likely branch paths (ban ihc single path and the eager 
execution models. 

FfeureiJ (rOSane^pDilispix:ulaij^ewcui5oa,{li) liil!! eager 
execution nnd(c) flaoiw e*£er execution 




3 JO Prediction ofixufirtct branches 

All branch prediction tcchm'qtres reported above are directed 
loww<fe prediction of direct branches, whose tar&eis 
are encoded in the instnjetktft itsalf. Indirect branches, 
which transfer control to sot address stored in a rcm'sler* 
ore even harder bo predict accurately. Though indirect 
brandies arc not as frequent as direct brajvehai in C- Or 
PORTRAN-foeiicftrtiArk jpcogounsj, indirect branches occuj 
with higher frequency in rnachine code compiled from 
object-orienicd programs like C++ and Java. Virtual 
Junction tables, used i n C++ and Java com pikra to implement 
Late binding of subroutine Invocations, execute an Indict 
branch for every polyrruKphlc call A simple BTB is a poor 
prcdktcr for branches wilri changing targets. One simpler 
possibility is to update the PHT to include the branch target 
addresses, 

Dricscii and Hoelzte (199&) reported an indirect branch 
frequency of once every 50 instructions for several 
large object-oriented C++ programs They jflvesdgated 
two-level and hybrid uidireet branch predictors and reported 
a mteptf^dicikm rate of 10% with a I *-cntry table* 7% with 
on 8 *-cntry table, 9% for a 1 .t-cotry hybrid predictor and 6% 
irt the 3 fc-crwry hybrid ptredicior ease. 

3. 1 I HlgfaiMMdwidth brxmth prediction 

Future microprocessors will require more than one prediction 
per cycle waning speculation qv*t multiple brunches ta a 
single cycle. Here the GAg scheme is able to predict multiple 
branches without knowing the branch Insinjctlon address. 
However, tte iifc$w)ciiOfi fcldi is also affected, When multiple 
branches are prediacd per cycle, instructions roust be- fetched 
from multiple target addresses per cycle; eontplicaiirtg 
l-caebc. access. A trace eaclbe (Rousabetj el al., 199c) in 
cornWitation with neM liace prediction Ls able to solve both 



problems by fetching from a dynamically assembled trace 
line, ralbcrlhnn from I -cache, 

A combination of branch handUng techniques will most 
likely applied, such as a rnultibybrid branch predictor 
CEvers et ak, 1996;. PaU et al, 1997) combined with 
support for contexts witching, indirttct jumps and interference 
handling A$ already emphasised, a fast and accurate branch 
prediction is essential for advanced superscalar processors 
with hundreds of irt-Oitfil tnstnictions* Branch prediction 
itself is already a welWevcfoped part of mkroarcWtecturc 
design. One observatJon 1$ ihat many brandies display 
differeflKhutrurtcfistics <hat cannot be optimally predicted by 
a single-scheme brajtch predict**, Evert ct al. (1996) propose 
hybrid branch pgediaars, a technique that was previously 
proposed by combining two predictors (McFarling, 1993) 
and that is »iready implemented in the PowerPC 620. 
Hybrid predictors comprise several pcedictorit, eadi targeting 
different classes of hnatbches. Tfat principal idea is that 
each prediclor scheme works best for a particular branch 
type. 

As predictor tables ipcfcnsc tn size, they often take- 
more time to react to changes in a program (wann-vp 
lime). A hybrid predictor with several components cam solve 
this problem by using component predictors with shorter 
warm-up times while the larger predfctois are warming up. 
Examples of predictors witJi «ncala warm-up times are 
twoOcvel predictors with shorter historiss as we3l a$ srriaJkr 
dynamic prcdictora (Pan et al., 1997). 

The ntultikybrid branch prtdiclor (Evers et aL> 1996; 
Pan et iil.» 1997) uses a set of selection counters foe each 
entry in die branch target buffar, Su the trace enche or in 
a similar arocture, Xeerpimj tnck of the predictor eurrcnrly 
mini accurate for each branch and then using the prediction 
from chat predictor forthat branch. The muhihybrfd predictor 
perfocim betuar than regular hybrid predictors. It reaches 
a prediction rate of 95% for 16kB prcdactor size and up 
to near 97% for 256 kE predictors using rxrograms of the 
SPECInic9j bencltmdrk svite (PaU et aL, 1997), Despite this 
high prediction rate, the rcoiaiitirig mispfedkciods slill incur 
a lafge performance penalty. Other branch techniques must 
be combined with branch predication. Such techniques are 
predication to enlarge me number of instroctions between 
t*x>5pcaUatiyc prcdicttons orboth-path execution {as InrJte 
PoCyPalh architecrurc by KJauscrctal. (199&)) Lmhecaseof 
low rxancti p<*ed»ciion tonfjdenec (Grunwald etal,, 1998). 

3, 12 Nvurzil brunch predictors 

The first dynamic neural branch predictors were proposed 
by Vtnian and Iratan {1999). The neural branch ptedictor 
research wa$ consisiently tJeveJopcd fMrthcrby Jimenez and 
Lin (2O0K2XW2) with Ihc ftrtlp&Kxptron pwdtaor, feasible 
co be Implemented Ln hardware. The mam udvantagc of 
the neural predictor consists in its ability to exploit long 
histories requiring linear resoutcas growth, while Classical 
predictors are: retiring eKpoocmiul resources growth. The 
main disadvantage of the pereeptron predictor consias in 3t$ 
hl£h latency. 

in oidcT to reduce the prediction latency, o.fa$t-path neutxii 
predictor was proposed (Jimenez, 20Ct3X Here,, a pcroaptron 
predictor choosing its waigbts for generailn^ a prediction 
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chortling lathe currcjiLbrancti's path, nylbcrlhan according 
to the branch's prat^m courucr, 

Many ssudies havs exlcndcd ifie perceptrou predictor. 
Loh and Henry <2002) use ibe neophron predictor as a 
cxnnponcnt of a large* hybrid predictor. Thomas et aJ_ (2003) 
find saiiemi history bha for the peroepwoa piea-iadr using 
dynamic data-flow analysis* AVfcary a ft], (2004) adapt <bc 
pcrccpuposi predici<jT ic* provide confidence estimates for 
$p$cuUilion control. Falcfin et al. (2004) use a perception 
predictor as a coropftfient of a p^ofibct/critic hybrid predictor 
thatransclttfra^ previous 
rwedkiions and possibly reverse them. 

The r» ural branch pttdicibt exwicept is very promising and 
Intel already Implement this idea tn one of the Itanium's 
sJmuj&iorS for researching future iQicroarcbtiecuin^.. 

4 Collusions 

Wricn a branch is doi pfcdiciad correctly, rjherc Is ranrfy 
a penalty of less tfian two cycles,, even in sample RISC 
twpelincs* However* itie mis prediction perkily depends on 
many otganbaiioftal (zzmmz the pipeline length (favouring 
shorter over lunger pipelines), the ovef al] OFgaiiiaadod of the 
pipeline, whether misspcajKated insunuiaicrnss crm be removed 
from intcmaD buffers ox have <o be executed and can only 
be removed in the retire -stage. Purifier dynamic aspects 
that irahnence the misprediction; peikaliy are the number of 
speculative Instruction* in the instruction window or the 
reorderbuffcr, Typically only a limited n umber- of I n.stn*:iici*$ 
can be removed each cycle. Therefore, reralling when a 
branch Ja avtepredicicd is typkally expensive, for example, 
1 1 or more cycles in the PemJum ll or ihct ALptm 21261 
proccsaors. Thei high mispredictioo penalty in currant and 
prospective fmurc mkroproccsscis show* the irrcrtf rtancc 
of excellent branch prediction mechanisms for Hie overall 
performance of a processor, 

Be$ufe mile and dynamic branch prediction they are 
other techniques to handte branches, Ira example, 
predication using so-called predicated or conditional 
Instructions that allow the removal of tlbe branch frorn the 
instruction .How and eager execution of 'both branch sides. 
Eajjer exeoaion Is especially effective when ihe braacri 
direction changes in an irregular fashion which means the 
branch L$ not predictable. In that case the expensive recalling 
mechanism slows down execution, However, eager execution 
Is not possible with today's superscalar proctors because 
the ability to pursue two bi$trw;tion streajrw in parallel is 
necessary- 
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Notes 

'A variaiioa of ft* BTB INK WB5 popular ft>f OCto prOOcssars 
wflhoUE OH -Cfl f p J'-CDCt«a c3 CO adreowsar marc target instiucd DCS 
ndalliqnDlty lo <he taqset aWrcss, 

2 P*i a al. (1992) used ube tenm 'brjpidh picdwtioaiabte 4 fenswad 
of THTuDd vu-blcjhia-RifilBBr 1 trateM of t 93CR\ 

^Thc OAj iehcroc » cnlled the 'dc^cncfatc case r of lite tomCatim 
vtoXM ly fan Mat <399Z>. 

*Thc PAp prcdicior is oudnfy of cheoreiicui inr^ns^ became <he 
variable erurdb^ or BHRs and PHTi cause irfTpleoraitadon 

problems. 

^Preifitiian acLuracj' measucod for $PECliTt95 or OLTP (OCilfefte 
imn^cuMHi pro«s$ifl£) pfo*ramw ks mucli lovrer than tor 
Sr£C39 toenchrwrts (weTabk I). 

'Called a binxidid pecdicwr by McFarling < I993X 

T C*I n k>ol predictor by McFfiriin^ pcr-addrcss schenne in Yc* 
and Paiti* twrtKitcLuw*. 
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r Brian Robert Prasky, one of the inventors of USSN 10/809,749 filed 03/25/2004 
relating to our invention of an Instruction Text Controlled Selectively Stated B ranees for 
Prediction via a Branch Target Buffer, declare with respect to this patent application that 
I have read the Office Action of Examiner Rober E Fennema, dated October 30, 2007 
and here say: 

That I believe thai the examiner errs and does not in the office action correctly state the 
knowledge of one of oniinary skill in the art at the time the invention was made. 

Furthermore, 1 say that one of ordinary skill in the art would have known that the IBM 
604 and 620 Power PC as described in the book referenced in the attachment hereto as 
printed from the Internet website of Google Books entitled "Modem Processor Design" 
by John Paul Shen et al, and that as shown on page 230, a BTAC (BTB) is addressed at 
instruction fetch and the BHT is addressed at instruction decode, where instruction 
decode is after instruction fetch- A prediction from the BTAC (BTB) is defined as taken 
and later on the BUT can override the instruction direction prediction of the BTB. 

I have also appended a reference called Dynamic branch prediction and control 
speculation from the International Journal High Performance Systems Architecture, VOI . 
1, No. 1, 2007 which clarifies that the BTAC and BTB nomenclature refer to the B ranch 
Target Buffer. 



Whereupon, I believe that anyone of ordinary skill in the art would disagree with the 
examiner's conclusion that ^Examiner asserts that no one of ordinary skill in the art 
would ever come to this conclusion, and that the only thing that makes sense in the 
context of this invention is to disable ail components associated with branch prediction 
when the user has explicitly indicated that branch prediction is to be disabled, and given 
that a BTB has no purpose outside of branch prediction, and that it cannot work with 
other branch prediction mechanisms disabled , that it would be disabled also/, because 
such a conclusion shows a clear misunderstanding of the workings of a BHT and BTB, 




